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Abstract 

We show that the default-all propagation scheme for 
database annotations is dangerous. Dangerous here 
means that it can propagate annotations to the query out- 
put which are semantically irrelevant to the query the 
user asked. This is the resuh of considering all relation- 
ally equivalent queries and returning the union of their 
where-provenance in an attempt to define a propagation 
scheme that is insensitive to query rewriting. 

We propose an alternative query-rewrite-insensitive 
(QRI) where-provenance called minimal propagation. 
It is analogous to the minimal witness basis for why- 
provenance, straight-forward to evaluate, and returns all 
relevant and only relevant annotations. 



1 Query-Rewrite-Insensitive provenance 

Provenance is sensitive to query rewriting unless care- 
fully defined. Sensitive here means that the returned 
provenance may be different for a relationally equivalent 
query (we focus exclusively on conjunctive queries under 
set semantics). This is surprising at first since we are ac- 
customed to leaving it to the database engine to choose 
the simplest relationally equivalent query to return our 
results. If we also consider provenance, then we are not 
guaranteed to get the provenance output we expect. 

With this argumentation, Buneman et al. ^ proposed 
that it is important to find a clean semantics for prove- 
nance that guarantees to give the same result for relation- 
ally equivalent queries. At least two well-known query- 
rewrite-insensitive (QRI) versions have been defined: 
Buneman et al. |2| defined the minimal witness basis 
for why-provenance, and Bhagwat et al. H] defined the 
default-all propagation scheme for where-provenance. 

Our goal with this paper is to point to some semantic 
problems with the way the QRI property is achieved by 
default-all propagation. We also show how to fix these 
problems with an alternative propagation scheme. 
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Figure 1: Particular definitions (naive, standard, QRI) for 
wliy- and wliere-provenance considered in tliis paper. 



Due to space constraints and in order to keep this pa- 
per to the point, we will assume basic familiarity of the 



reader with the provenance concepts given in Fig. 1 and 
not repeat their formal definitions. Instead, we refer to 
the detailed survey of Cheney et al. |4| from which we 



also borrow the running example of Fig. 2 and Fig. 4 
(and the milk example after giving a real-world interpre- 
tation to the annotations). Appendix A summarizes the 



notation used throughout this paper. 

2 The minimal witness basis 
as QRI why-provenance 

Why-Provenance identifies witness tuples: "What input 
tuples contribute to the presence of each output tuple?" 
A witness is subset of the input tuples that is sufficient to 
ensure that a given output tuple t appears in the result of a 
query. This definition implies that the whole database is 
a witness as it is sufficient for f to appear. The witness ba- 
sis or why-provenance Ctvv(f,2) is a subset of only rele- 
vant witnesses where the definition by Buneman et al. ||2l 
makes precise what "relevant" means. Intuitively, those 
tuples that have been involved in some operation during 
query evaluation are part of the witness basis. It turns out 
that why-provenance is not QRI, and relationally equiv- 
alent queries may have different witness bases. 

Buneman et al. |2 | showed that a subset of the wit- 
ness basis, called the minimal witness basis and written 
here as a^'{t,Q), is invariant under rewriting. It con- 
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Figure 2: (a): Input table R. (b,c): Identical queries gi {x.y) :—R(x,y) and 22(^1^) -) together with the why- 

provenance a,! of their tuples. (d,e): Lineage «/ and minimal witness basis a"^. of the tuples for Q2. 



sists of all the minimal witnesses in the witness basis, 
where a witness is minimal if none of its proper subsets 
is also a witness. For example, the why-provenance of f4 
in Q2 in |Fig. 2c| is a,v(f 4,62) = {{t\},{h,h}}, however 
«w(f4,e4) = {{h}} in|Fig. 2d|since {fj C {fi,f2}, and 



thus, {fi,f2} is not minimal. 

Lineage ai{t,Q) for an output tuple f is a subset of 
the input tuples which are relevant to the output tuple, 
where the definition by Cui and Widom |5| makes pre- 
cise what "relevant" means. Intuitively, we can get the 
lineage by taking the union over all witnesses in the why- 
provenance. We write this as ai{t,Q) — ya».(f,2). For 
example, a/(f4,62) = ^UJa,,(f4, 62) = ^{{fi}, {fi,f2}} 



{tuh} in Fig. 2e 



3 Default-all propagation 
as QRI where-provenance 

Where-provenance focuses on cells (f ,A), i.e. tuples t to- 
gether with an attribute A, and identifies witness cells: 
"Where (from what cell) does an output tuple value come 
from?" Hence, where-provenance of a cell (f ,A) consists 
of locations or values that can be found in tuples of the 
why-provenance of t. Since where-provenance was in- 
vestigated in the context of propagating annotations from 
input to output cells |[3], we write ap(f ,A, Q) for where- 
provenance (cp. Fig, ip . 

Where-provenance is also not the same for equivalent 
queries, and there are two distinct issues to consider: (1) 
The first has to do with the way we write a conjunc- 
tive query in SQL (thus called "SQL interpretation" in 

Query Q'^ selects 



Fig. 1 



Fig. 3 



and is illustrated with 
attribute A from table R, whereas Q'^ selects it from table 
S. Hence, a naive interpretation of propagation through 
SQL would lead to propagated values a*{t^,Q'j^) — {a,c} 
versus a*{t^,Q") — {g}. This problem disappears once 
we consider Catalog notation, and got taken care of by 
the definition of propagation rules in |[3l which propa- 
gate annotations from attributes of both joined tables. 

(2) Secondly, certain relational rewrites do not pre- 
serve annotation propagation. Figure 4 gives a de- 
tailed example taken from (4] that shows that relationally 
equivalent queries Qi and Q2 result in different annota- 
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(d) A 



SELECT distinct R.A, S.C 
FROM R,S 
WHERE S.C = 2 



(e) e'3' 

SELECT distinct S.A, S.C 
FROIVI R,S 
WHERE S.C = 2 



Figure 3: A naive "SQL interpretation" of query 
QT,(x,y) R"(x,y),S"{x, '2') would lead to different where- 
provenances for cell {t?,,A) in the output depending on 
whether SQL queries Q'^ or were used. 



tions of their output (cp. Fig. 4b vs. Fig. 4c 1. 

In an attempt to define a QRI propagation scheme 
for where-provenance, Bhagwat et al. 1 1 1 define 
the default-all propagation scheme, written here as 
a'f,{t,A,Q). Their system DBNotes achieves QRI for 
where-provenance by including the provenance of all re- 
lationally equivalent rewrites Q' for a query Q: 

a'i,{t,A,Q)-^ U «p(^^.e') 

Q'=Q 



For example. Fig. 4d shows the result annotations for 
both equivalent queries Q\ and Q2 over the input table 



Fig. 4a in the default-all propagation scheme. Intuitively, 
for both Qi and Q2, default-all propagation returns the 
where-provenance of the relationally equivalent query 

Q{x,y):-R"{x,y),R''{.,y)X{^,-)- 

4 Default-all propagation is dangerous! 

We next illustrate that the QRI property of default-all 
comes at a high price, namely the price of propagating 
irrelevant tuples to the output. This can be dangerous. 

Example 1 (Milk). Hanako lives in Tokyo and 
worries about the recent nuclear accidents at the 
Fukushima nuclear power plant. She likes to drink 
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Figure 4: (a): Annotated table R". (b,c): Equivalent queries Qi{x,y):—R"(x,y) and Q2{x,y):—R"{x,y),R"{x,_) with the 
where-provenance Up of their cells. (d,e): QKl\ariants default-all propagation and minimal propagation a'p. 
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(c) Annotation b 



user: Bob 

date: March 18, 2011, 8:43pm 

I have just measured half a glass of milk with my Geiger 
counter. 1 found five times the allowed amounts of lodine- 
131 and Cesium- 137. 1 will make a second measurement 
tomorrow to confirm. 



(d) Annotation / 



user: Fuyumi 

date: March 19, 2011, 7:25am 

I measured 250ml bought yesterday and today, and I can 
assure you I found only small, negligible traces. 



Figure 5: (a): Database for Example 1 Note that table R" 
is semantically the same as R" in Fig. 4a taken from |4 1. (b): 
The query is Q^iy) :-i?''('LFMilk',,v), i.e. "find all annota- 
tions for LF Milk." (c,d): Content of annotations b and /. 



lactose-free milk, but has just heard that traces of ra- 
dioactive Cesium-137 were found in LF Milk of the lo- 
cal store. She is worried, and not so without reason. 
She queries a community database \Fig. 5a\ for the con- 
tent of LF Milk. The database includes data and user- 
generated annotations. She wants to make sure that she 
gets all relevant information and therefore opts for the 
default-all propagation scheme of user-generated com- 
munity annotations (she is not familiar with databases 
and provenance, but "default-all" just sounds like the 



right thing to do). The database returns Fig. 5b with two 



annotations: b and f shown in Fig. 5c and Fig. 5d 

Based on the annotations the database returns, she de- 
cides to buy and drink the milk. Fuyumi is a very rep- 
utable friend of hers, and Fuyumi claims in the most 
recent annotation f that her measurements shows only 
low levels of radiation. However, what Hanako does 



not realize (and what the database does not expose 
to her) is that Fuyumi's comment has nothing to do 
with LF Milk. The comment propagated to the out- 
put because the database included annotations from 
all relationally equivalent queries. One such query 
is Q'^iy):-R"i'LFMilk',y),R"(.,y), which is responsi- 
ble for propagating to the output an annotation about 
Cesium-137 in SC Water, a completely different product. 

Basically, the default-all propagation scheme has 
given Hanako semantically irrelevant annotations, based 
on which she then made the wrong decision. 

5 Non-dangerous QRI where-provenance 

Why is default-all propagation dangerous? The reason is 
a mismatch in the semantics. Just because two different 
tuples have the same value in an attribute does not imply 
that the annotations of those attribute values are related in 
any way. And, whereas rewriting the query Qi into query 
Q2 with an additional (and unnecessary) self-join on ta- 
ble R does not change the output tuples, we now have a 
join with semantically irrelevant tuples that propagates 
irrelevant information. And since the first step of mak- 



ing the scheme QRI (that of avoiding the issue in Fig. 3 1 
propagates annotations from all cells that join, default- 
all propagation will make sure that completely irrelevant 
annotations propagate to the query output. 

We propose instead the minimal propagation scheme. 
Intuitively, for a given output cell (r,A), we intersect the 
where-provenance ap{t,A,Q) with the annotations in the 
minimal witness basis a!J^{t,Q) on all attributes A' con- 
tiibuting to Up. Written differently: 

a;(f,A,e):= U ap{t',A',R'it')) 

t'ema';;,(t.Q) 

A'eattributes of l' propagating to cell(f.A) 

Here, the expression ^a'^{t^Q) transforms the minimal 
witness basis as sets of sets of tuples into a set of tu- 
ples (hence, it can be interpreted as a form of QRI lin- 
eage). The overall expression unions from all tuples t' 
in the minimal witness basis, the annotations Up from 
all attributes A' of input table R'{t') from which tuple f' 
propagated values to the cell (f ,A). Since those attributes 
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(b) QRI where-provenance for cell {14, A) in Fig. 4 
iffia/ propagation a'p in green vs. default-all in red. 

Figure 6: (a) Tlie minimal witness basis a'" considers only 
minimal sets of witnesses that imply the output, (b) In con- 
trast, default-all propagation a'p considers the union of an- 
notations for all equivalent queries. We propose instead 
minimal propagation a™ as QRI where-provenance analo- 
gous to the minimal witness basis a'" for why-provenance. 



are never changed by rewriting a query into an equivalent 
query, the output is well-defined and QRI. 

The minimal propagation scheme has the following 
desirable properties: 

(i) Just like default-all propagation, it is also QRI. 

(ii) There is no need to evaluate any rewrite of a given 
query. ' 

(iii) Just as the minimal witness basis for why- 
provenance, it considers a minimal and QRI set of 



values (see Fig. 6 1. The intuition is that, among all 
relationally equivalent queries, those that have no 



irrelevant self-join (cp. Q\ vs. G2 in Fig. 4i are the 
ones that most closely capture the user's actually in- 
tended semantics of the query. 
For our running example, |Fig. 4d| and |Fig. 4e| compare 
the output of default-all with that of minimal propaga- 
tion. For example, both where-provenance and default- 
all propagation return {a,c} for output cell (f4,A) in 
query G2- In contrast, minimal propagation is {a}, be- 
cause fi from R" is the only tuple in the minimal wit- 
ness basis (^yJoc™(f4,G2) ~ {h}) with one contributing at- 
tribute A. Hence, a" (f4,A,G2) = OLpihAiR") = {a}. 



In our milk example (Example 1 1, minimal propaga 



tion gives the only relevant annotation b. 



Bhagwat et al. |1| provide an optimization that avoids having to 
evaluate infinitely many equivalent formulations for default-all, and it 
suffices to evaluate only a finite number. 



6 Conclusions 

Arguably, the QRI (query-rewrite-insensitive) property 
of annotation propagation is desirable. We do not discuss 
here whether this is indeed the case, but merely point out 
that, if aiming for QRI, care has to be taken not to trade a 
meaningful semantics in exchange for this property. 

We illustrated that the default-all propagation scheme 
achieves QRI by including annotations from relationally 
equivalent but somehow irrelevant rewrites. This can 
lead to spurious annotations in the output which are se- 
mantically irrelevant, and thus can give the user a wrong 
impression of relevance. Hence, default-all is dangerous. 

We proposed minimal propagation which is QRI, has 
a clean and simple semantics, and propagates all relevant 
and only relevant annotations to the output. 
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A Notation 



tj input or output tuple 

R,S input tables | sets of tuples 

A,B,C attributes of a table 
Qi queries or output tables 

(XK{t,Q) why-provenance (witness basis) for tuple t of ta- 
ble Q I if context is known, also used as (Xw{Q) or 
0!,t,(f) I set of sets of tuples 

(x"'.{) minimal witness basis 

0!/('! Q) lineage of tuple / in table Q \ set of tuples 

R" , S" annotated input tables 

ap{t,A,Q) where-provenance (propagation) for the value of 
cell (f,A) of table Q \ if context is known, also used 
as ap{t,A) or ap{Q") \ set of values 

a^O default-all propagation 

a™ ( ) minimal propagation 
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